While you follow this lab, you may want to open these cheat sheets:
Open a command line interface (e.g. Terminal or GitBash)
Change your working directory to a location where you will store all the materials for this lab
Use mkdir to create a directory lab05 for the lab materials
Use cd to change directory to (i.e. move inside) lab05
Create other subdirectories: data, report, images
Use ls to list the contents of lab05 and confirm that you have all the subdirectories.
Use touch to create an empty README.md text file
Use a text editor (e.g. the one in RStudio) to open the README.md file, and then add a brief description of today’s lab, using markdown syntax.
Change directory to the data/ folder.
Download the data file with the command curl, and the -O option (letter O)
Use ls to confirm that the csv file is in data/.
Use word count wc to count the lines of the csv file
Take a peek at the first rows of the csv file with head
Take a peek at the last 5 rows of the csv file with tail
cd Desktop
mkdir lab05
cd lab05
mkdir data
mkdir report
mkdir images
ls
touch README.md
# added brief description of lab 05 with markdown syntax
open README.md
cd data
curl -O https://raw.githubusercontent.com/ucb-stat133/stat133-fall-2018/master/data/nba2018-players.csv
ls
wc nba2018-players.csv
head 5 nba2018-players.csv
tail 5 nba2018-players.csv
I will include this code, but I only need to run this command once to download dplyr and ggplot2. install.packages(c("dplyr", "ggplot2"))
About loading packages: Another rule to keep in mind is to always load any required packages at the very top of your script files (.R or .Rmd or .Rnw files). Avoid calling the library() function in the middle of a script. Instead, load all the packages before anything else.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
The other important specification to include in your Rmd file is a global chunk option to specify the location of plots and graphics. This is done by setting the fig.path argument inside the knitr::opts_chunk$set() function.
If you don’t specify fig.path, "knitr" will create a default directory to store all the plots produced when knitting an Rmd file. This time, however, we want to have more control over where things are placed. Because you already have a folder images/ as part of the filestructure, this is where we want "knitr" to save all the generated graphics. Notice the use of a relative path fig.path = '../images/'. This is because your Rmd file should be inside the folder report/, but the folder images/ is outside report/ (i.e. in the same parent directory of report/). I did this part at the beginning of the Rmd file.
The data file for this lab is: nba2018-players.csv. To import the data in R you can use the base function read.csv(), or you can also use read_csv() from the package "readr“:
library(readr)
setwd("/Users/sharonhui/Desktop/lab05/data")
dat <- read_csv('nba2018-players.csv')
## Parsed with column specification:
## cols(
## player = col_character(),
## team = col_character(),
## position = col_character(),
## height = col_integer(),
## weight = col_integer(),
## age = col_integer(),
## experience = col_integer(),
## college = col_character(),
## salary = col_double(),
## games = col_integer(),
## minutes = col_integer(),
## points = col_integer(),
## points3 = col_integer(),
## points2 = col_integer(),
## points1 = col_integer()
## )
To make the learning process of “dplyr” gentler, Hadley Wickham proposes beginning with a set of five basic verbs or operations for data frames (each verb corresponds to a function in “dplyr”):
filter: keep rows matching criteria
select: pick columns by name
mutate: add new variables
arrange: reorder rows
summarise: reduce variables to values
Slightly modified Hadley’s list of verbs:
filter(), slice(), and select(): subsetting and selecting rows and columns
mutate(): add new variables
arrange(): reorder rows
summarise(): reduce variables to values
group_by(): grouped (aggregate) operations
slice() allows you to select rows by position
filter() allows you to select rows by condition.
select() allows you to select columns by name
three_rows <- slice(dat, 1:3)
gt_85 <- filter(dat, height > 85)
player_height <- select(dat, player, height)
Your turn:
slice() to subset the data by selecting the first 5 rows.slice(dat, 1:5)
## # A tibble: 5 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Al Horford BOS C 82 245 30 9
## 2 Amir Johnson BOS PF 81 240 29 11
## 3 Avery Bradley BOS SG 74 180 26 6
## 4 Demetrius Jackson BOS PG 73 201 22 0
## 5 Gerald Green BOS SF 79 205 31 9
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
slice() to subset the data by selecting rows 10, 15, 20, …, 50.slice(dat, c(10, 15, 20, 25, 30, 35, 40, 50))
## # A tibble: 8 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Jonas Jerebko BOS PF 82 231 29 6
## 2 Tyler Zeller BOS C 84 253 27 4
## 3 Derrick Williams CLE PF 80 240 25 5
## 4 Jordan McRae CLE SG 78 185 25 1
## 5 Larry Sanders CLE C 83 235 28 5
## 6 Cory Joseph TOR PG 75 193 25 5
## 7 Jakob Poeltl TOR C 84 248 21 0
## 8 Bradley Beal WAS SG 77 207 23 4
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
slice() to subset the data by selecting the last 5 rows.slice(dat, ((nrow(dat))-4):(nrow(dat)))
## # A tibble: 5 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Marquese Chriss PHO PF 82 233 19 0
## 2 Ronnie Price PHO PG 74 190 33 11
## 3 T.J. Warren PHO SF 80 230 23 2
## 4 Tyler Ulis PHO PG 70 150 21 0
## 5 Tyson Chandler PHO C 85 240 34 15
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
# Another way to do this
tail(slice(dat), 5)
## # A tibble: 5 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Marquese Chriss PHO PF 82 233 19 0
## 2 Ronnie Price PHO PG 74 190 33 11
## 3 T.J. Warren PHO SF 80 230 23 2
## 4 Tyler Ulis PHO PG 70 150 21 0
## 5 Tyson Chandler PHO C 85 240 34 15
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
filter() to subset those players with height less than 70 inches tall.filter(dat, dat$height < 70)
## # A tibble: 2 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Isaiah Thomas BOS PG 69 185 27 5
## 2 Kay Felder CLE PG 69 176 21 0
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
filter() to subset rows of Golden State Warriors (‘GSW’).filter(dat, dat$team == "GSW")
## # A tibble: 16 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Anderson Varejao GSW C 82 273 34 12
## 2 Andre Iguodala GSW SF 78 215 33 12
## 3 Damian Jones GSW C 84 245 21 0
## 4 David West GSW C 81 250 36 13
## 5 Draymond Green GSW PF 79 230 26 4
## 6 Ian Clark GSW SG 75 175 25 3
## 7 James Michael McAdoo GSW PF 81 230 24 2
## 8 JaVale McGee GSW C 84 270 29 8
## 9 Kevin Durant GSW PF 81 240 28 9
## 10 Kevon Looney GSW C 81 220 20 1
## 11 Klay Thompson GSW SG 79 215 26 5
## 12 Matt Barnes GSW SF 79 226 36 13
## 13 Patrick McCaw GSW SG 79 185 21 0
## 14 Shaun Livingston GSW PG 79 192 31 11
## 15 Stephen Curry GSW PG 75 190 28 7
## 16 Zaza Pachulia GSW C 83 270 32 13
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
filter() to subset rows of GSW centers (‘C’).filter(dat, (dat$team =="GSW") & (dat$position == "C"))
## # A tibble: 6 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Anderson Varejao GSW C 82 273 34 12
## 2 Damian Jones GSW C 84 245 21 0
## 3 David West GSW C 81 250 36 13
## 4 JaVale McGee GSW C 84 270 29 8
## 5 Kevon Looney GSW C 81 220 20 1
## 6 Zaza Pachulia GSW C 83 270 32 13
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
filter() and then select(), to subset rows of lakers (‘LAL’), and then display their names.dat %>%
filter(dat$team == "LAL") %>%
select(player)
## # A tibble: 14 x 1
## player
## <chr>
## 1 Brandon Ingram
## 2 Corey Brewer
## 3 D'Angelo Russell
## 4 David Nwaba
## 5 Ivica Zubac
## 6 Jordan Clarkson
## 7 Julius Randle
## 8 Luol Deng
## 9 Metta World Peace
## 10 Nick Young
## 11 Tarik Black
## 12 Thomas Robinson
## 13 Timofey Mozgov
## 14 Tyler Ennis
filter() and then select(), to display the name and salary, of GSW point guardsdat %>%
filter(team == "GSW" & position == "PG") %>%
select(player, salary)
## # A tibble: 2 x 2
## player salary
## <chr> <dbl>
## 1 Shaun Livingston 5782450
## 2 Stephen Curry 12112359
dat %>%
filter(experience > 10 & salary <= 10000000) %>%
select(player, age, team)
## # A tibble: 36 x 3
## player age team
## <chr> <int> <chr>
## 1 Andrew Bogut 32 CLE
## 2 Dahntay Jones 36 CLE
## 3 Deron Williams 32 CLE
## 4 James Jones 36 CLE
## 5 Kyle Korver 35 CLE
## 6 Richard Jefferson 36 CLE
## 7 Jose Calderon 35 ATL
## 8 Kris Humphries 31 ATL
## 9 Mike Dunleavy 36 ATL
## 10 Jason Terry 39 MIL
## # ... with 26 more rows
head(dat %>%
filter(experience == "0" & age == "20") %>%
select(player, team, height, weight), 5)
## # A tibble: 5 x 4
## player team height weight
## <chr> <chr> <int> <int>
## 1 Jaylen Brown BOS 79 225
## 2 Henry Ellenson DET 83 245
## 3 Stephen Zimmerman ORL 84 240
## 4 Dejounte Murray SAS 77 170
## 5 Chinanu Onuaku HOU 82 245
Another basic verb is mutate() which allows you to add new variables. Let’s create a small data frame for the warriors with three columns: player, height, and weight:
# creating a small data frame step by step
gsw <- filter(dat, team == 'GSW')
gsw <- select(gsw, player, height, weight)
gsw <- slice(gsw, c(4, 8, 10, 14, 15))
gsw
## # A tibble: 5 x 3
## player height weight
## <chr> <int> <int>
## 1 David West 81 250
## 2 JaVale McGee 84 270
## 3 Kevon Looney 81 220
## 4 Shaun Livingston 79 192
## 5 Stephen Curry 75 190
Now, let’s use mutate() to (temporarily) add a column with the ratio height / weight:
mutate(gsw, height / weight)
## # A tibble: 5 x 4
## player height weight `height/weight`
## <chr> <int> <int> <dbl>
## 1 David West 81 250 0.3240000
## 2 JaVale McGee 84 270 0.3111111
## 3 Kevon Looney 81 220 0.3681818
## 4 Shaun Livingston 79 192 0.4114583
## 5 Stephen Curry 75 190 0.3947368
Create a new name like ht_wt = height / weight:
mutate(gsw, ht_wt = height / weight)
## # A tibble: 5 x 4
## player height weight ht_wt
## <chr> <int> <int> <dbl>
## 1 David West 81 250 0.3240000
## 2 JaVale McGee 84 270 0.3111111
## 3 Kevon Looney 81 220 0.3681818
## 4 Shaun Livingston 79 192 0.4114583
## 5 Stephen Curry 75 190 0.3947368
In order to permanently change the data, you need to assign the changes to an object:
gsw2 <- mutate(gsw, ht_m = height * 0.0254, wt_kg = weight * 0.4536)
gsw2
## # A tibble: 5 x 5
## player height weight ht_m wt_kg
## <chr> <int> <int> <dbl> <dbl>
## 1 David West 81 250 2.0574 113.4000
## 2 JaVale McGee 84 270 2.1336 122.4720
## 3 Kevon Looney 81 220 2.0574 99.7920
## 4 Shaun Livingston 79 192 2.0066 87.0912
## 5 Stephen Curry 75 190 1.9050 86.1840
The next basic verb of “dplyr” is arrange() which allows you to reorder rows. For example, here’s how to arrange the rows of gsw by height
arrange(gsw, height)
## # A tibble: 5 x 3
## player height weight
## <chr> <int> <int>
## 1 Stephen Curry 75 190
## 2 Shaun Livingston 79 192
## 3 David West 81 250
## 4 Kevon Looney 81 220
## 5 JaVale McGee 84 270
By default arrange() sorts rows in increasing order. To arrange rows in descending order you need to use the auxiliary function desc().
arrange(gsw, desc(height))
## # A tibble: 5 x 3
## player height weight
## <chr> <int> <int>
## 1 JaVale McGee 84 270
## 2 David West 81 250
## 3 Kevon Looney 81 220
## 4 Shaun Livingston 79 192
## 5 Stephen Curry 75 190
# order rows by height, and then weight
arrange(gsw, height, weight)
## # A tibble: 5 x 3
## player height weight
## <chr> <int> <int>
## 1 Stephen Curry 75 190
## 2 Shaun Livingston 79 192
## 3 Kevon Looney 81 220
## 4 David West 81 250
## 5 JaVale McGee 84 270
gsw, add a new variable product with the product of height and weight.mutate(gsw, product = height * weight)
## # A tibble: 5 x 4
## player height weight product
## <chr> <int> <int> <int>
## 1 David West 81 250 20250
## 2 JaVale McGee 84 270 22680
## 3 Kevon Looney 81 220 17820
## 4 Shaun Livingston 79 192 15168
## 5 Stephen Curry 75 190 14250
gsw3, by adding columns log_height and log_weight with the log transformations of height and weight.gsw3 <- mutate(gsw, log_height = log(height), log_weight = log(weight))
filter() and arrange() those players with height less than 71 inches tall, in increasing order.new_dat <- filter(dat, dat$height < 71)
arrange(new_dat, new_dat$height)
## # A tibble: 4 x 15
## player team position height weight age experience
## <chr> <chr> <chr> <int> <int> <int> <int>
## 1 Isaiah Thomas BOS PG 69 185 27 5
## 2 Kay Felder CLE PG 69 176 21 0
## 3 Pierre Jackson DAL PG 70 180 25 0
## 4 Tyler Ulis PHO PG 70 150 21 0
## # ... with 8 more variables: college <chr>, salary <dbl>, games <int>,
## # minutes <int>, points <int>, points3 <int>, points2 <int>,
## # points1 <int>
head(select(arrange(dat, desc(salary)), player, team, salary), 5)
## # A tibble: 5 x 3
## player team salary
## <chr> <chr> <dbl>
## 1 LeBron James CLE 30963450
## 2 Al Horford BOS 26540100
## 3 DeMar DeRozan TOR 26540100
## 4 Kevin Durant GSW 26540100
## 5 James Harden HOU 26540100
head(select(arrange(dat, desc(salary)), player, team, salary), 5)
## # A tibble: 5 x 3
## player team salary
## <chr> <chr> <dbl>
## 1 LeBron James CLE 30963450
## 2 Al Horford BOS 26540100
## 3 DeMar DeRozan TOR 26540100
## 4 Kevin Durant GSW 26540100
## 5 James Harden HOU 26540100
head(select(arrange(dat, desc(points3)), player, team, points3), 10)
## # A tibble: 10 x 3
## player team points3
## <chr> <chr> <int>
## 1 Stephen Curry GSW 324
## 2 Klay Thompson GSW 268
## 3 James Harden HOU 262
## 4 Eric Gordon HOU 246
## 5 Isaiah Thomas BOS 245
## 6 Kemba Walker CHO 240
## 7 Bradley Beal WAS 223
## 8 Damian Lillard POR 214
## 9 Ryan Anderson HOU 204
## 10 J.J. Redick LAC 201
# Another way
slice(select(arrange(dat, desc(points3)), player, team, points3), 1:10)
## # A tibble: 10 x 3
## player team points3
## <chr> <chr> <int>
## 1 Stephen Curry GSW 324
## 2 Klay Thompson GSW 268
## 3 James Harden HOU 262
## 4 Eric Gordon HOU 246
## 5 Isaiah Thomas BOS 245
## 6 Kemba Walker CHO 240
## 7 Bradley Beal WAS 223
## 8 Damian Lillard POR 214
## 9 Ryan Anderson HOU 204
## 10 J.J. Redick LAC 201
gsw_mpg of GSW players, that contains variables for player name, experience, and min_per_game (minutes per game), sorted by min_per_game (in descending order)data.frame(arrange(select(mutate(filter(dat, team == "GSW"), min_per_game = minutes/games), player, experience, min_per_game), desc(min_per_game)))
## player experience min_per_game
## 1 Klay Thompson 5 33.961538
## 2 Stephen Curry 7 33.392405
## 3 Kevin Durant 9 33.387097
## 4 Draymond Green 4 32.513158
## 5 Andre Iguodala 12 26.289474
## 6 Matt Barnes 13 20.500000
## 7 Zaza Pachulia 13 18.114286
## 8 Shaun Livingston 11 17.697368
## 9 Patrick McCaw 0 15.126761
## 10 Ian Clark 3 14.766234
## 11 David West 13 12.558824
## 12 JaVale McGee 8 9.597403
## 13 James Michael McAdoo 2 8.788462
## 14 Damian Jones 0 8.500000
## 15 Kevon Looney 1 8.433962
## 16 Anderson Varejao 12 6.571429
summarise()The next verb is summarise(). Conceptually, this involves applying a function on one or more columns, in order to summarize values. This is probably easier to understand with one example.
Say you are interested in calculating the average salary of all NBA players. To do this “a la dplyr” you use summarise(), or its synonym function summarize():
Calculating an average like this seems a bit verbose, especially when you can directly use mean() like this:
# average salary of NBA players
summarise(dat, avg_salary = mean(salary))
## # A tibble: 1 x 1
## avg_salary
## <dbl>
## 1 5804697
mean(dat$salary)
## [1] 5804697
What if you want to calculate some summary statistics for salary: min, median, mean, and max?
# some stats for salary (dplyr)
summarise(
dat,
min = min(salary),
median = median(salary),
avg = mean(salary),
max = max(salary)
)
## # A tibble: 1 x 4
## min median avg max
## <dbl> <dbl> <dbl> <dbl>
## 1 5145 3e+06 5804697 30963450
Well, this may still look like not much. You can do the same in base R (there are actually better ways to do this):
# some stats for salary (base R)
c(min = min(dat$salary), median = median(dat$salary), median = mean(dat$salary), max = max(dat$salary))
## min median median max
## 5145 3000000 5804697 30963450
To actually appreciate the power of summarise(), we need to introduce the other major basic verb in “dplyr”: group_by(). This is the function that allows you to perform data aggregations, or grouped operations.
Let’s see the combination of summarise() and group_by() to calculate the average salary by team:
# average salary, grouped by team
summarise( group_by(dat, team), avg_salary = mean(salary) )
## # A tibble: 30 x 2
## team avg_salary
## <chr> <dbl>
## 1 ATL 5494447
## 2 BOS 6127673
## 3 BRK 4011351
## 4 CHI 5781368
## 5 CHO 5531548
## 6 CLE 7069699
## 7 DAL 5157128
## 8 DEN 4648719
## 9 DET 6871632
## 10 GSW 6265160
## # ... with 20 more rows
# average salary, grouped by position
summarise(
group_by(dat, position),
avg_salary = mean(salary)
)
## # A tibble: 5 x 2
## position avg_salary
## <chr> <dbl>
## 1 C 6529906
## 2 PF 5801127
## 3 PG 5601217
## 4 SF 6042455
## 5 SG 5114178
# average weight and height, by position, displayed in desceding order by average height
arrange(
summarise(
group_by(dat, position),
avg_height = mean(height),
avg_weight = mean(weight)),
desc(avg_height)
)
## # A tibble: 5 x 3
## position avg_height avg_weight
## <chr> <dbl> <dbl>
## 1 C 83.21649 251.1031
## 2 PF 81.40816 235.2857
## 3 SF 79.52381 220.2381
## 4 SG 77.04902 204.3431
## 5 PG 74.32292 188.9583
use summarise() to get the largest height value.
summarise(dat, largest_height_value = max(height))
## # A tibble: 1 x 1
## largest_height_value
## <dbl>
## 1 87
use summarise() to get the standard deviation of points3.
summarise(dat, standard_deviation_of_points3 = sd(points3))
## # A tibble: 1 x 1
## standard_deviation_of_points3
## <dbl>
## 1 55.11807
use summarise() and group_by() to display the median of three-points, by team.
summarise(
group_by(dat, team),
median_points3 = median(points3)
)
## # A tibble: 30 x 2
## team median_points3
## <chr> <dbl>
## 1 ATL 32.0
## 2 BOS 46.0
## 3 BRK 36.0
## 4 CHI 28.5
## 5 CHO 13.0
## 6 CLE 26.5
## 7 DAL 18.0
## 8 DEN 46.0
## 9 DET 28.0
## 10 GSW 10.5
## # ... with 20 more rows
display the average triple points by team, in ascending order, of the bottom-5 teams (worst 3pointer teams).
tail(arrange((summarise(
group_by(dat, team),
average_points3 = mean(points3)
)), desc(average_points3)), 5)
## # A tibble: 5 x 2
## team average_points3
## <chr> <dbl>
## 1 CHI 35.31250
## 2 SAC 35.12500
## 3 ORL 34.33333
## 4 PHO 33.47059
## 5 NOP 32.43750
arrange((tail(arrange((summarise(
group_by(dat, team),
average_points3 = mean(points3)
)), desc(average_points3)), 5)), (average_points3))
## # A tibble: 5 x 2
## team average_points3
## <chr> <dbl>
## 1 NOP 32.43750
## 2 PHO 33.47059
## 3 ORL 34.33333
## 4 SAC 35.12500
## 5 CHI 35.31250
obtain the mean and standard deviation of age, for Power Forwards, with 5 and 10 years (including) years of experience.
summarise(select(filter(dat, dat$position == "PF", dat$experience >=5 & dat$experience <= 10), age), mean_power_forwards = mean(age), sd_power_forwards = sd(age))
## # A tibble: 1 x 2
## mean_power_forwards sd_power_forwards
## <dbl> <dbl>
## 1 28.43243 2.267408
The main function in “ggplot2” is ggplot()
The main input to ggplot() is a data frame object.
You can use the internal function aes() to specify what columns of the data frame will be used for the graphical elements of the plot.
You must specify what kind of geometric objects or geoms will be displayed: e.g. geom_point(), geom_bar(), geom_boxpot().
Pretty much anything else that you want to add to your plot is controlled by auxiliary functions, especially those things that have to do with the format, rather than the underlying data.
The construction of a ggplot is done by adding layers with the + operator.
When including code for plots and graphics, we strongly recommend that you create an individual code chunk for each plot, and that you give a label to that chunk.
# scatterplot (option 1)
ggplot(data = dat) +
geom_point(aes(x = points, y = salary))
ggplot() creates an object of class “ggplot”
the main input for ggplot() is data which must be a data frame
then we use the "+" operator to add a layer
the geometric object (geom) are points: geom_points()
aes() is used to specify the x and y coordinates, by taking columns points and salary from the data frame
# scatterplot (option 2)
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point()
# Say you want to color code the points in terms of position
# colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position))
# Maybe you wan to modify the size of the dots in terms of points3:
# sized and colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position, size = points3))
# To add some transparency effect to the dots, you can use the alpha parameter.
# sized and colored scatterplot
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position, size = points3), alpha = 0.7)
Notice that alpha was specified outside aes(). This is because we are not using any column for the alpha transparency values.
Open the ggplot2 cheatsheet
Use the data frame gsw to make a scatterplot of height and weight.
ggplot(data = gsw, aes(x = height, y = weight)) + geom_point()
Find out how to make another scatterplot of height and weight, using geom_text() to display the names of the players.
ggplot(data = gsw, aes(x = height, y = weight)) +
geom_text(aes(label = player))
Get a scatter plot of height and weight, for ALL the warriors, displaying their names with geom_label().
ggplot(data = filter(dat, team == "GSW"), aes(x = height, y = weight)) +
geom_point() +
geom_label(aes(label = player))
Get a density plot of salary (for all NBA players).
ggplot(data = dat, aes(x = dat$salary)) + geom_density()
Get a histogram of points2 with binwidth of 50 (for all NBA player
ggplot(data = dat, aes(x=points2)) +
geom_histogram(bins = 50)
Get a barchart of the position frequencies (for all NBA players).
ggplot(data = dat, aes( x= position)) +
geom_bar()
Make a scatterplot of experience and salary of all Centers, and use geom_smooth() to add a regression line.
ggplot(data = filter(dat, dat$position == "C"), aes(x = experience, y = salary)) + geom_point(size = .2) + geom_smooth(method = lm)
Repeat the same scatterplot of experience and salary of all Centers, but now use geom_smooth() to add a loess line (i.e. smooth line).
ggplot(data = filter(dat, dat$position == "C"), aes(x = experience, y = salary)) + geom_point(size = .2) + geom_smooth(method = loess)
One of the most attractive features of “ggplot2” is the ability to display multiple facets. The idea of facets is to divide a plot into subplots based on the values of one or more categorical (or discrete) variables.
Here’s an example. What if you want to get scatterplots of points and salary separated (or grouped) by position? This is where faceting comes handy, and you can use facet_wrap() for this purpose:
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point() +
facet_wrap(~ position)
The other faceting function is facet_grid(), which allows you to control the layout of the facets (by rows, by columns, etc)
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_grid(~ position) +
geom_smooth(method = loess)
# scatterplot by position
ggplot(data = dat, aes(x = points, y = salary)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_grid(position ~ .) +
geom_smooth(method = loess)
experience and salary faceting by position.# scatterplot by position
ggplot(data = dat, aes(x = experience, y = salary)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_grid(~ position) +
geom_smooth(method = loess)
experience and salary faceting by team# scatterplot by team
ggplot(data = dat, aes(x = experience, y = salary)) +
geom_point(aes(color = team), alpha = 0.7) +
facet_wrap(team ~ .) +
geom_smooth(method = loess)
age faceting by teamggplot(data = dat, aes(x = age)) + geom_density() + facet_wrap(team ~ .)
height and weight faceting by positionggplot(data = dat, aes(x = height, y = weight)) + geom_point(size = .5) + facet_wrap(position ~ .)
height and weight, with a 2-dimensional density, geom_density2d(), faceting by positionoptions(warn=-1)
# scatterplot by position
ggplot(data = dat, aes(x = height, y = weight)) +
geom_point(aes(color = position), alpha = 0.7) +
facet_wrap(position ~ .) +
geom_smooth(method = loess) + geom_density2d()
experience and salary for the Warriors, but this time add a layer with theme_bw() to get a simpler backgroundggplot(data = filter(dat, dat$team == "GSW"), aes(x = experience, y = salary)) +
geom_point(size = .5) + theme_bw()
theme_minimal(), theme_dark(), theme_classic()ggplot(data = filter(dat, dat$team == "GSW"), aes(x = experience, y = salary)) +
geom_point(size = .5) + theme_classic()
Open the terminal.
Move inside the images/ directory of the lab.
List the contents of this directory.
Now list the contents of the directory in long format.
How would you list the contents in long format, by time?
How would you list the contents displaying the results in reverse (alphabetical)? order
Without changing your current directory, create a directory copies at the parent level (i.e. lab05/).
Copy one of the PNG files to the copies folder.
Use the wildcard * to copy all the .png files in the directory copies.
Change to the directory copies. Use the command mv to rename some of your PNG files.
Change to the report/ directory.
From within report/, find out how to rename the directory copies as copy-files.
From within report/, delete one or two PNG files in copy-files.
From within report/, find out how to delete the directory copy-files.
cd Desktop
cd lab05
cd images/
ls
ls -l
ls -l -t
ls -r -l
mkdir ../copies
cp scatterplotwithfacetgrid2-1.png ../copies
cp *.png ../copies
cd ..
cd copies
mv scatterplotwithfacetgrid2-1.png scatterplotwithfacetgrid2.png
mv scatterplotwithfacetgrid1-1.png scatterplotwithafacetgridaboutposition.png
mv scatterplotwithgeom_label-1.png scatterplotgeom_labelheightweight.png
mv scatterplotofexperiencesalaryposition-1.png scatterplotexpsalpos.png
cd ..
cd report/
mv ../copies ../copy-files
rm ../copy-files/repeatplotwiththeme_classic-1.png
rm ../copy-files/densityplotofageteam-1.png
rm -R ../copy-files